Web Mining for an Amharic - English Bilingual Corpus

نویسندگان

  • Atelach Alemu Argaw
  • Lars Asker
چکیده

We present recent work aimed at constructing a bilingual corpus consisting of comparable Amharic and English news texts. The Amharic and English texts were collected from an Ethiopian news agency that publishes daily news in Amharic and English through their web page. The Amharic texts are represented using Ethiopic script and archived according to the Ethiopian calender. The overlap between the corresponding Amharic and English news texts in the archive is comparatively small, only approximately one article out of ten has a corresponding translated version. Thus a major part of the work has been to identify the subset of matching news texts in the archive, transliterating the Amharic texts into an ASCII representation, and aligning them with their respective corresponding English version. In doing so, we utilised a number of available software and data sources that were (mainly) found on the Internet. Amharic is a language for which very few computational linguistic tools or corpora (such as electronic lexica, part-of-speech taggers, parsers or tree-banks) exist. A challenge has therefor been to show that it is possible to create a comparable corpus even in the absence to these resources. We used fuzzy string matching between words in the English and Amharic titles as a way to determine how likely it is that two news items are referring to the same event. In order to restrict the matching algorithm further, we only compared titles of news items that were published on the corresponding same date and at the same place. We present an experimental evaluation of the algorithm, based on data from one year, and show that fuzzy string matching of news titles can be sufficient to align Amharic and English news text with relatively high precision despite the obvious difference between the two languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data-driven Amharic-English Bilingual Lexicon Acquisition

This paper describes a simple approach of statistical language modelling for bilingual lexicon acquisition from Amharic-English parallel corpora. The goal is to induce a seed translation lexicon from sentence-aligned corpora. The seed translation lexicon contains matches of Amharic lexemes to weekly inflected English words. Purely statistical measures of term distribution are used as the basis ...

متن کامل

Mining a News Archive for a Comparable Corpus

The emergence of the World Wide Web has provided new and important opportunities to easily access and combine data and information from several different sources and thereby enabling the construction of new resources for researchers everywhere. In this paper we describe recent work that we have done in order to construct a comparable corpus consisting of Amharic and English news texts. In doing...

متن کامل

Bilingual Experiments with an Arabic-English Corpus for Opinion Mining

Recently, Opinion Mining (OM) is receiving more attention due to the abundance of forums, blogs, ecommerce web sites, news reports and additional web sources where people tend to express their opinions. There are a number of works about Sentiment Analysis (SA) studying the task of identifying the polarity, whether the opinion expressed in a text is positive or negative about a given topic. Howe...

متن کامل

Dictionary-based Amharic - English Information Retrieval

We present two approaches to the Amharic – English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the same task. The resulting translated (En...

متن کامل

Amharic-English Information Retrieval with Pseudo Relevance Feedback

We describe cross language retrieval experiments using Amharic queries and English language document collection from our participation in the bilingual ad hoc track at the CLEF 2007. Two monolingual and eight bilingual runs were submitted. The bilingual experiments designed varied in terms of usage of long and short queries, presence of pseudo relevance feedback (PRF), and three approaches (max...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005